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Abstract 

We propose a novel hybrid loss for multiclass and structured prediction prob- 
lems that is a convex combination of log loss for Conditional Random Fields 
(CRFs) and a multiclass hinge loss for Support Vector Machines (SVMs). We 
provide a sufficient condition for when the hybrid loss is Fisher consistent for clas- 
sification. This condition depends on a measure of dominance between labels - 
specifically, the gap in per observation probabilities between the most likely la- 
bels. We also prove Fisher consistency is necessary for parametric consistency 
when learning models such as CRFs. 

We demonstrate empirically that the hybrid loss typically performs as least as 
well as - and often better than - both of its constituent losses on variety of tasks. In 
doing so we also provide an empirical comparison of the efficacy of probabilistic 
and margin based approaches to multiclass and structured prediction and the effects 
of label dominance on these results. 

1 Introduction 

Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) can be seen 
as representative of two different approaches to classification problems. The former 
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is purely probabilistic - the conditional probability of classes given each observa- 
tion is explicitly modelled - while the latter is purely discriminative - classification 
is performed without any attempt to model probabilities. Both approaches have their 
strengths and weaknesses. CRFs QUI are known to yield the Bayes optimal solution 
asymptomatically but often require a large number of training examples to do accurate 
modelling. In contrast, SVMs make more efficient use of training examples but are 
known to be inconsistent when there are more than two classes lfT3l l8l. 

Despite their differences, CRFs and SVMs appear very similar when viewed as 
optimisation problems. The most salient difference is the loss used by each: CRFs 
are trained using a log loss while SVMs typically use a hinge loss. In an attempt to 
capitalise on their relative strengths and avoid their weaknesses, we propose a novel 
hybrid loss which "blends" the two losses. After some background (jQ we provide the 
following analysis: We argue that Fisher Consistency for Classification (FCC) - a.k.a. 
classification calibration - is too coarse a notion and introduce a distribution-dependent 
refinement called Conditional Fisher Consistency for Classification (S|3]). We prove the 
hybrid loss is conditionally FCC and give a noise condition that relates the hybrid loss's 
mixture parameter to a margin-like property of the data distribution ( §3. lj l. We then 
show that, although FCC is effectively a non-parametric condition, it is also a necessary 
condition for consistent risk minimisation using parametric models (\ 3.2 1. Finally, we 
empirically test the hybrid loss on various domains including multiclass classification, 
Chunking and Named Entity Recognition and show it consistently performs better than 
either of its constituent losses (<Q. 



2 Losses for Multiclass Prediction 

In classification problems observations x € X are paired with labels y £ y via some 
joint distribution D over X x y. We will write D(x,y) for the joint probability and 
D(y\x) for the conditional probability of y given x. Since the labels y are finite and 
discrete we will also use the notation D y (x) for the conditional probability to empha- 
sise that distributions over y can be thought of as vectors in K fc for k = | y |. We will 
use q to denote distributions over y when the observations x 6 X are irrelevant. 

When the number of possible labels k = | y | > 2 we call the classification problem 
a multiclass classification problem. A special case of this type of problem is structured 
prediction where the set of labels y has some combinatorial structure that typically 
means k is very large As seen in the experimental section below a variety of 

problems, such as text tagging, can be construed as structured prediction problems. 

Given m training observations S = {(xi, ?/;)}"=i drawn i.i.d. from D, the aim of 
the learner is to produce a predictor h : X — > y that minimises the misclassification 
error e£>(/i) =Fd [h(x) ^ y]. Since the true distribution is unknown, an approximate 
solution to this problem is typically found by minimising a regularised empirical esti- 

1 In structured prediction, each output y involves relationships among 'sub-components' of y. For ex- 
ample, the label of a pixel in an image depends on the label of neighbouring pixels. That's where the term 
'structured' comes from. However, different y's are typically not assumed to possess any joint structure (i.e., 
it is typically assumed that the data is drawn from X X y). This is why structured prediction is no different 
in essence than multiclass classification. 
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mate of the risk for a surrogate loss I. Examples of surrogate losses will be discussed 
below. 

Once a loss is specified, a solution is found by solving 

m 

mm -J2l(f(xi), yi ) + n(f) (1) 

i— 1 

where each model f : X — > R k assigns a vector of scores f(x) to each observation 
and the regulariser 0(/) penalises overly complex functions. A model / found in this 
way can be transformed into a predictor by defining hf(x) — argmax ye y f v {x). We 
will overload the definition of misclassification error and sometimes write eu(/) as 
shorthand for eo(hf). 

In structured prediction, the models are usually specified in terms of a parameter 
vector w 6 K." and a feature map <fi : X x y — > R™ by defining f y (x;w) — (w, <j>(x, y)} 
and in this case the regulariser is f2(/) = ^\\w\\ 2 for some choice of A € K. This is the 
framework used to implement the SVMs and CRFs used in the experiments described 
in Section|4] Although much of our analysis does not assume any particular parametric 



model, we explicitly discuss the implications of doing so in { 3.2 

A common surrogate loss for multiclass problems is a generalisation of the binary 
class hinge loss used for Support Vector Machines [5]: 

i H (f,y) = [l-M(f,y)] + (2) 

where [z] + — z for z > and is otherwise, and M(f, y) = f y — m&x y i^ y f is the 
margin for the vector / € R k . Intuitively, the hinge loss is minimised by models that 
not only classify observations correctly but also maximise the difference between the 
highest and second highest scores assigned to the labels. 

While there are other, consistent losses for SVMs lfl3l [8ll. these cannot scale up to 
structured estimations due to computational issues. For example, the multiclass hinge 
loss X)j 7 t a [l + fj( x )]+ i s shown to be consistent in [8 |. However, it requires evaluating 
/ on all possible labels except the true y. This is intractable for structured estimation 
where the possible labels grow exponentially with the size of the structured output. 
Since the other known and consistent multiclass hinge losses have similar intractability 
we will only focus on the margin-based loss Iji which can be evaluated quickly using 
techniques from dynamic programming, linear programming etc. lfT4l [121 111 . 



2.1 Probabilistic Models and Losses 



The scores given to labels by a general model / : X — > M. k can be transformed into a 
conditional probability distribution p(x; f) € [0, l] k by letting 



Py(x;f) = 



exp{f y (x)) 



(3) 



It is easy to show that under this interpretation the hinge loss for a probabilistic 
model p — p(-; f) is given by 



?H{p,y) = 



1 - In 



JPy 

<y'^y Py' 
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Another well known loss for probabilistic models, such as CRFs, is the log loss 

£l(p,v) = -lnpj,. 

This loss penalises models that assign low probability to likely instances labels and, 
implicitly, that assign high probability to unlikely labels. 

We now propose a novel hybrid loss for probabilistic models that is a convex com- 
bination of the hinge and log losses 

i a (p,y) = a£ L (p,y) + (1 - a)£ H (p,y) (4) 

where mixture of the two losses is controlled by a parameter a G [0,1]. Setting a = 1 
or a = recovers the log loss or hinge loss, respectively. The intention is that choosing 
a close to will emphasise having the maximum gap between the largest and second 
largest label probabilities while an a close to 1 will force models to prefer accurate 
probability assessments over strong classification. 

3 Fisher Consistency For Classification 

A desirable property for a loss is that, given enough data, the models obtained by 
minimising the loss at each observation will make predictions that are consistent with 
the true label probabilities at each observation. 

Formally, we say vector / g Kj y > is aligned with a distribution q over y when- 
ever maximisers of / are also maximisers for q. That is, when argmax yg y f y C 
argmax^gvj q y . If, for all label distributions q, minimising the conditional risk L(f) — 
^y~q[i(f,y)] for a loss £ yields a vector /* aligned with q we will say £ is Fisher 
consistent for classification (FCC)[^]— or classification calibrated 1131 . This is an im- 
portant property for losses since it is equivalent to the asymptotic consistency of the 
empirical risk minimiser for that loss iTPJl Theorem 2]. 

The standard multiclass hinge loss £jj is known to be inconsistent for classification 
when there are more than two classes j8l fT3l . The analysis in [8 1 shows that the hinge 
loss is inconsistent whenever there is an instance x with a non-dominant distribution - 
that is, D y (x) < ~ for all y £ y. Conversely, A distribution is dominant for an instance 
x if there is some y with D y (x) > §. In contrast, the log loss used to train non- 
parametric CRFs is Fisher consistent for probability estimation - that is, the associated 
risk is minimised by the true conditional distribution - and thus tc is FCC since the 
minimising distribution is equal to D(x) and thus aligned with D(x). 

3.1 Conditional Consistency of the Hybrid Loss 

In order to analyse the consistency of the hybrid loss we introduce a more refined notion 
of Fisher consistency that takes into account the true distribution of class labels. If 
q = (gi, . . . , qk) is a distribution over the labels y then we say the loss £ is conditionally 

2 Note that the Fisher consistency for classification is weaker than Fisher consistency for density estima- 
tion. The former requires the same prediction only, while the latter requires the estimated density is the same 
as the true data distribution. In this paper, we focus on the former only. 
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FCC with respect to q whenever minimising the conditional risk w.r.t. q, L q (f) = 
Eyr^ q [l(f, y)] yields a predictor /* that is consistent with q. Of course, if a loss £ is 
conditionally FCC w.r.t. q for all q it is, by definition, (unconditionally) FCC. 

Theorem 1 Let q = (qi, . . . , q^) be a distribution over labels and let yi = maxj, q y 
and 2/2 = uia,x y ^ yi q y be the two most likely labels. Then the hybrid loss £ a is condi- 
tionally FCC for q whenever q Vl > \ or 

a>l-^-^. (5) 
1 - 2q yi 

For the proof see Appendix[A] Theorem[T|can be inverted and interpreted as a constraint 
on the conditional distributions of some data distribution D such that a hybrid loss 
with parameter a will yield consistent predictions. Specifically, the hybrid loss will be 
consistent if, for all x G X such that q = D(x) has no dominant label {i.e., D y (x) < | 
for all y g y), the gap D yi (x) — D y2 (x) between the top two probabilities is larger 
than (1 — a)(l — 2D yi (x)). When this is not the case for some x, the classification 
problem for that instance is, in some sense, too difficult to disambiguate. In this sense, 
the bound can be seen as a property on distributions akin to Tsybakov's noise condition 
[?]. Making this analogy precise is the focus of ongoing work. 



3.2 Parametric Consistency 

Since Fisher consistency is defined point-wise on observations, it is not directly ap- 
plicable to parametric models as these enforce inter-observational constraints (e.g. 
smoothness). Abstractly, assuming parametric hypotheses can be seen as a restriction 
over the space of allowable scoring functions. When learning parametric models, risks 
are minimised over some subset 7 of functions from X — > instead of all possible 
functions. We now show that, given some weak assumptions on the hypothesis class 
5", a loss being FCC is a necessary condition if the loss is also to be IF-consistent. 

We say a loss £ is If-consistent if, for any distribution, minimising its associ- 
ated risk over 3" yields a hypothesis with minimal 0-1 loss in 3^ Recall that the 
risk of a hypothesis / G 3 r associated with a loss £ and distribution D over 
is Lu(f) = Ed [£(y, f(x))] and its 0-1 risk or misclassification error is e£>(f) — 

[y 7^ argmax y , e y f y /(x)\. Formally then, given a function class If we say £ is 
3 -consistent if, for all distributions D, 

L D (f*)= mf L D (/) =► e D (/* ) = inf e D (f). (6) 

We need a relatively weak condition on function classes 3^ to state our theorem. 
We say a class £F is regular if the follow two properties hold: 1) For any there 
exists an x G X and an / G J so that f(x) = g; and 2) For any x G X and y G ^ 
there exists an / G £F so that y = argmax y / g y f y '(x). Intuitively, the first condition 

3 While this is simpler and stronger than the usual asymptotic notation of consistency [?] it most read- 
ily relates to FCC and suffices for our discussion since we are only establishing that FCC is a necessary 
condition. 
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says that for any distribution over labels there must be a function in the class which 
models it perfectly on some point in the input space. The second condition requires 
that any mode can be modelled on any input. Importantly, these properties are fairly 
weak in that they do not say anything about the constraints a function class might put 
on relationships between distributions modelled on different inputs. 

Theorem 2 For regular function classes 2r any loss that is ^-consistent is necessarily 
also Fisher Consistent for Classification (FCC). 

The full proof is in Appendix[B] The argument sketch is: since ^-consistency requires 
|6} to hold for all D it must hold for a D with all of its mass on a single observation 
Xo- If i is not FCC there must be some label distribution q and vector g so that L q (g) is 
minimal but e q (g) is not. Choosing xq so that f(xo) — g (by the regularity of 5) and 
setting D(y\x) = q gives a contradiction. 



3.3 Generalisation Bound 

We now give a PAC-Bayesian bound 1 10 1 for the generalisation error of the hybrid 
model that can be specialised to recover a bound for the multiclass hinge loss. A 
similar, alternative bound for the hybrid loss and an extended proof is available in 
Appendix [C] 

Theorem 3 (Generalisation Margin Bound) For any data distribution D, for any 
prior P over w, for any w, any 5 G (0, 1] and for any 7 > and any a G (0, 1], 
with probability at least 1 — 5 over random samples S from D with m instances, there 
exists a constant c, such that 



2 C (i-a)T» InWD+mm 



en < P( I)9) . S (E Q (M( W ', y)) <j) + 



]n8- 



Proof [sketch] By choosing the weight prior P(w) = |exp(-^L) and the pos- 
terior Q{w') = ^ exp(— ^ w ), one can show er> — Pd(Eq M(w',y) < 0) by 
symmetry argument proposed in [?, 9|. Applying the PAC-Bayes margin bound [?, ?] 
and knowing the margin threshold 7' < c(l - a) 7 and KL(Q||P) = yields the 
theorem. ■ 



Setting a = in the above bound recovers a margin bound for S VMs (see [?] for an 
averaging classifiers of SVMs, and [?] for structured case). Unfortunately, one cannot 
set a — 1 to achieve a PAC-Bayes bound for a pure log loss classifier in this manner 
due the the (1 — oi) dependence. However, to our knowledge, we are not aware of 
any PAC-Bayes bound on the generalisation error for log loss. 



4 Experiments 
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The analysis of the hybrid loss suggests it should 
be able to outperform the hinge loss due to its 
improved consistency on distributions with non- 
dominant labels. Furthermore, it should also 
make more efficient use of data than log loss on 
distributions with dominant labels. These hy- 
potheses were confirmed by applying the hybrid, 
log and hinge losses to a number of synthetic mul- 
ticlass data sets in which the data set size and pro- 
portion of examples with non-dominant labels are 
carefully controlled. 

We also compared the hybrid loss with the log 
and hinge losses on several real structured estima- 
tion problems and observed that the hybrid loss 
regularly outperforms the other losses and consistently performs at least as well as the 
better of the log and hinge losses on any problem. 

4.1 Multiclass Classification 

Two types of multiclass simulations were performed. The first examined the perfor- 
mances of the hybrid, log and hinge losses when no observations had a dominant label. 
That is all observations were drawn from a D with D y (x) < 1/2 for all labels y. The 
second experiment considered distributions with a controlled mixture of observations 
with dominant and non-dominant labels. 

Non-dominant Distributions To make the experiment as simple as possible, we con- 
sidered an observation space of size | X | = 1 and focused on varying the number of 
labels and their probabilities. The label set y took the sizes | ^ | = 3, 4, 5, . . . , 10. 
One label y* £ ^ was assigned probability D y * (x) = 0.46 and the remainder are 
given an equal portion of 0.54 {e.g., in the 3 class case the other labels each have 
probability 0.27, and in the 10 class case, 0.06). Note that this means for all the la- 
bel set sizes, the gap D y *(x) — D y (x) is at least 0.19 which is always greater than 
(1 — a)(l — 2D y * (x)) = 0.04 so the hybrid consistency condition |5]l is always met. 

Features were a constant value in M 2 as were the parameter vectors w y £ E 2 for 
y £ y. Models were found using LBFGS [3|. The resulting training errors for hinge, 
log and hybrid losses are plotted in Figure [T] as a function of the number of labels. As 
we can clearly see, the hinge loss error increases as the number of classes increases, 
whereas the errors for the log and the hybrid losses remain a constant (1 — D y * (x)), in 
concordance with the consistency analysis. 

Mix of Non-dominant and Dominant Distributions The second synthetic experi- 
ment examined how the three losses performed given various training set sizes (denoted 
by to) and various proportions of instances with non-dominant distributions (denoted 
by p). 



■■SVM 

-CRF and Hybrid -O- 



number of classes 



Figure 1: Training Error for the 
hybrid (a = 0.5), log and hinge 
loss vs. number of classes in non- 
dominant case. 
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Hinge Log Log 



(a) Hybrid v.s. Hinge (31/15) (b) Hybrid v.s. Log (34/15) (c) Hinge v.s. Log (30/23) 

Figure 2: Performance of the hybrid, hinge, and log losses on non-dominant/dominant 
mixtures. Points denote pairs of test accuracies for models trained on one of 60 data 
sets using the losses named on the axes. Score (a/b) denotes the vertical loss with a 
wins and b losses (ties not counted). 



We generated 60 different data sets, all with y = {1, 2, 3, 4, 5}, in the following 
manner: Instances came from either a non-dominant class distribution or a dominant 
class distribution. In the non-dominant class case, x € M 1 00 is set to a predefined, con- 
stant, non-zero vector and its label distribution is Di(x) = 0.4 and D y (x) — 0.15 for 
y > 1. In the dominant case, each dimension Xi was drawn from a normal distribution 
N(p = 1 + y, <t = 0.6) depending on the class y = 1, . . . , 5. The proportion p ranged 
over 10 values p = 0.1, 0.2, 0.3, . . . , 1 and for each p, test and validation sets of size 
1000 were generated. Training set sizes of rn = 30, 60, 100, 300, 600, 1000 were used 
for each p value for a total of 60 training sets. The optimal regularisation parameter 
A and hybrid loss parameter a were selected using the validation set for each loss on 
each training set. Then models with parameters w y £ R 100 for y £ y were found using 
LBFGS [3] for each of the three losses on each of the 60 training sets and then assessed 
using the test set. 

The results are summarised in Figure [2] Each point shows the test accuracy for a 
pair of losses. The predominance of points above the diagonal lines in a) and b) show 
that the hybrid loss outperforms the hinge loss and the log loss in most of the data sets, 
while the log and hinge losses perform competitively against each other. 

4.2 Structured Estimation 

Unlike the general multiclass case, structured estimation problems have a higher chance 
of non-dominant distributions because of the very large number of labels as well as ties 
or ambiguity regarding those labels. For example, in text chunking, changing the tag 
one phrase while leaving the rest unchanged should not drastically change the proba- 
bility predictions - especially when there are ambiguities. Because of the prevalence 
of non-dominant distributions, we expect that training models using a hinge loss to 
perform poorly on these problems relative to training with hybrid or log losses. 

CONLL2000 Text Chunking Our first structured estimation experiment is earned 
out on the CONLL2000 text chunking task |4|. The data set has 8936 training sentences 
and 2012 testing sentences with 106978 and 23852 phrases (a.k.a. chunks) respectively. 
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500 1000 1500 2000 2500 2000 4000 6000 8000 10000 

number of sentences number of sentences 

(a) the testing set (b) the training set 



Figure 3: Estimated probabilities of the true label D y .(xi) and most likely label 
Dy*{xi). Sentences are sorted according to D y .(xi) and D y *(xi) respectively in as- 
cending order. D = 1/2 is shown as the straight black dot line. About 700 sentences 
out of 2012 in the testing set and 2000 sentences out of 8936 in the training set have no 
dominant class. 



Table 1: Accuracy, precision, recall and Fl Score on the CONLL2000 text chunking 
task. 



Train Portion 


Loss 


Accuracy 


Precision 


Recall 


Fl Score 




Hinge 


91.14 


85.31 


85.52 


85.41 


0.1 


Log 


92.05 


87.04 


87.01 


87.02 




Hybrid 


92.07 


87.17 


86.93 


87.05 




Hinge 


94.61 


91.23 


91.37 


91.30 


1 


Log 


95.10 


92.32 


91.97 


92.15 




Hybrid 


95.11 


92.35 


92.00 


92.17 



The task is to divide a text into syntactically correlated parts of words such as noun 
phrases, verb phrases, and so on. For a sentence with L chunks, its label consists 
of the tagging sequence of all its chunks, i.e. y = (y 1 , y 2 , . . . , y L ), where y % is the 
chunking tag for chunk i. As commonly used in this task, the label y is modelled 
as a ID Markov chain to account for the dependency between adjacent chunking tags 
{y{,y{ +1 ) given observation x^. Clearly, the model has exponentially many possible 
labels, which suggests there are many non-dominant classes. 

Since the true underlying distribution is unknown, we train a CRlQon the train- 
ing set and then apply the trained model to both testing and training datasets to get 
an estimate of the conditional distributions for each instance. We sort the sentences 
Xi from highest to lowest estimated probability on the true chunking label yi given 
Xi. The result is plotted in Figure [3] from which we observe the existence of many 
non-dominant distributions — about 1/3 of the testing sentences and about 1/4 of the 
training sentences. 

We split the data into 3 parts: training (20%), testing (40%) and validation (40%). 

4 using the feature template from the CRF++ toolkit 1 6 1 , and the CRF code from Leon Bottou 1 2 ] . 
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Table 2: Accuracy, precision, recall and Fl Score on the baseNP chunking task for 
training on increasing portions of training set. 



Train Portion 


Loss 


Accuracy 


Precision 


Recall 


Fl Score 




Hinge 


88.48 


71.70 


75.96 


73.77 


0.1 


Log 


90.86 


81.09 


78.96 


80.01 




Hybrid 


90.90 


81.23 


79.09 


80.15 




Hinge 


94.64 


87.58 


88.30 


87.94 


1 


Log 


95.21 


90.07 


88.89 


89.48 




Hybrid 


95.24 


90.12 


88.98 


89.55 



Table 3: Accuracy, precision, recall and Fl Score on the Japanese named entity recog- 
nition task. 



Loss 


Accuracy 


Precision 


Recall 


Fl Score 


Hinge 


95.63 


73.24 


64.37 


68.52 


Log 


95.92 


78.22 


64.85 


70.91 


Hybrid 


95.95 


79.02 


65.32 


71.52 



The regularisation parameter A and the weight a were determined via parameter selec- 
tion using the validation set. To see the performance with different training sizes, we 
took part of the training data to learn the model and gathered statistics on the test set. 
The accuracy, precision, recall and Fl Score on test set are reported in Table [2] when 
using 10% and 100% of the training set. The hybrid loss outperforms both the hinge 
loss and the log loss (albeit marginally). 

baseNP Chunking A similar methodology to the previous experiment is applied to 
the BaseNP data set (6). It has 900 sentences in total and the task is to automatically 
classify a chunking phrase is as baseNP or not. We split the data into 3 parts: training 
(20%), testing (40%) and validation (40%). Once again, A and a are determined via 
model selection on the validation set. We report the test accuracy, precision, recall and 
Fl Score in Table[2]for training on increasing proportion of the training set. The hybrid 
outperforms the other two losses on all measures. 

Japanese named entity recognition Finally, we used a multiclass data set containing 
716 Japanese sentences and 17 annotated named entities |6|. The task is to locate and 
classify proper nouns and numerical information in a document into certain classes of 
named entities such as names of persons, organizations, and locations. We train all 3 
models on 216 sentences and test on 500 sentences with the default parameters found 
in Bottou's CRF code. The extra parameter a is selected for the smallest test error. The 
result is reported in Table [3] Once again, the hybrid loss outperforms the others two 
losses. 
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5 Conclusion and Discussion 



We have provided theoretical and empirical motivation for the use of a novel hybrid 
loss for multiclass and structured prediction problems which can be used in place of 
the more common log loss or multiclass hinge loss. This new loss attempts to blend the 
strength of purely discriminative approaches to classification, such as Support Vector 
machines, with probabilistic approaches, such as Conditional Random Fields. Theo- 
retically, the hybrid loss enjoys better consistency guarantees than the hinge loss while 
experimentally we have seen that the addition of a purely discriminative component 
can improve accuracy when data is less prevalent. 

5.1 Future Work 

Theoretically, we expect that some stronger sufficient conditions on a are possible 
since the bounds used to establish Theorem [T] are not tight. Our conjecture is that 
a necessary and sufficient condition would include a dependency on the number of 
classes. We are also investigating connections between a and the multiclass Tsybakov 
noise condition [?]. 

To our knowledge, the notion of a regular function class for the purposes of con- 
sistency analysis is a novel one. Characterisations of this property for various existing 
parametric models would make testing for regularity easier. 

One current limitation of the hybrid model is the use of a single, fixed a for all 
observations in a training set. One interesting avenue to explore would be trying to 
dynamically estimate a good value of a on a per-observation basis. This may further 
improve the efficacy of the hybrid loss by exploiting the robustness of SVMs (low a) 
when the label distribution for an observation has a dominant class but switching to 
probability estimation via CRFs (high a) when this is not the case. 
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A Proof for Consistency 

Proof of Theorem[l]We use L a (p, D) = E y ~D [£a(p, y)] and A(y) to denote distributions 
over Since we a free to permute labels within y we will assume without loss of generality 
that D\ = maxygy D y and D2 = max^x D y . The proof now proceeds by contradiction and 
assumes there is some minimiserp = argmin^g^y) L a (q, D) that is not aligned with D. That 
is, there is some y* 7^ 1 such fhatp y » > pi. For simplicity, and again without loss of generality, 
we will assume y* — 2. 

The first case to consider is when P2 is a maximum and p\ < p2- Here we construct a q that 
"flips" the values of p\ and p2 and leaves all the values unchanged. That is, qi — P2, q2 = Pi 
and q y = p y for all y — 3, . . . , k. Intuitively, this new point is closer to D and therefore the 
CRF component of the loss will be reduced while the SVM loss won't increase. The difference 
in conditional risks satisfies 

k 

L a (p,D)-L a (q,D) = $iA,.(*a(p, »)-£,(«,»)) 

y=l 

= Di.(4.(p,l) -<«(«, 1)) 

+D 2 .(£ a (p,2) -£ a (q, 2)) 
= {D 1 -D 2 )(£ a (q,2)-l a (q,l)) 

since £ a (p, 1) = £ a (q, 2) and £ a (p, 2) = £ a (q, 1) and the other terms cancel by construction. 
As Di — D2 > by assumption, all that is required now is to show that £ a (q, 2) — £ a {q, 1) = 
a In 21 + (1 - a)(£ H (q, 2) - £ H (q, 1)) is strictly positive. 

Since gi > q y for y / 1 we have In — > 0, £n{q, 2) = [l — In — 1 > 1, and £ H (q, 1) = 

1 - ln^] < 1, and so ^/(g, 2) - £ H {q, 1) > 1 - 1 = 0. Thus, £ a (q, 2) - £ a (q, 1) > as 
required. 

Now suppose that P2 = pi is a maximum. In this case we show a slight perturbation q — 
(pi+e,p2 — e,p3, . . . ,Pk) yields a lower for e > 0. For y / 1, 2 we have £l{p, y) — £(q, y) = 
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and since p 2 > p y and q± > q y thus £h(p, y) - tii(q, v) = 1 — In §J + 1 — In = In |J > 
1 — JIJ- = — ^ since — In x > 1 — x for a; G (0,1) and q\ — pi + e = p2 + e. Therefore 



£ a (p,y) -£ a (q,y) > -e 



(1-a) 



Pi 



(7) 



Wheny = 1, t t (p, 1) - l L {q, 1) = - In & > 2i_l£i = £ and £ H (p, 1) - M<Z> 1) = 
(1 - In &) - (1 - In 2i) = In g = hi E±S since Pl = p 2 . Thus fe(P> 1) - £ H (q, 1) > 
1 - = And so 



£<*(p,y) -£ a (q,y) > e 



a 2(1 -a 

— + - 

Pi pi + e 



(8) 



Finally, when y = 2 we have 4(p, 2) - f L (q, 2) = - lng > 2a=£a = =5 an( i £ ff ( Pj 2) 
<h(o,2) = (1 - In 2a) - (1 - In aa) = In 2a > i _ 2i = _^L. Thus, 

n "' ' 1 Pl ' v 91 ' 91 92 Pi+f 



C(p,2)-^(g,2) > 



a 2(1 -a) 

— + - 

Pi Pi + e 



Putting the inequalities ([TJ, {8} and {5} together yields 



lim 

e->0 



L a (p, D) - L a (q, D) 



> lim(Di-D 2 ) 



a 2(1 -a) 
— + - - ' 
Pi Pi + e 



y=3 



1-a 
Pi 



-^—(2 -a) (1-a) 



(9) 



= — (Di -D 2 + (1 -a)(2Di - 1)). 
Pi 

Observing that since Di > D2, when Di > | the final term is positive without any constraint 
on a and when Di < | the difference in risks is positive whenever 



a > 1 



D1-D2 
1 - 2Di 



(10) 



completes the proof. 



B Proof of Necessity of FCC 

Proof of Theorem [2] The proof is by contradiction. We assume we have a regular function 
class 3" and a loss £ which is iF-consistent but not FCC. That is, l|6j holds for £ but there exists a 
distribution p over y such that there is a g £ R y which minimises the conditional risk L q (g) but 
argmax Hsa g y ^ argmax ygH q y . 

By the assumption of the regularity of 1 there is an a; £ X and a / £ "3 so that f(x) — g. 
We now define a distribution D over X x y that puts all its mass on the set {a;} x y so that 
D(x,y) = p y . Since this distribution is concentrated on a single x its full risk and conditional 
risk on x are the same. That is, Ld{-) ~ L p (-). Thus, 

M/) = M/) = inf W) = inf L D (/') 
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By the assumption of ^-consistency, since / is a minimiser of Ld it must also minimise en- 
Once again, the construction of D means that er>(f) = e p (g) = ¥ y ^ p [y 7^ argmax^gy p H ] = 
1 — p Vcj where y g — argmax y g y is the label predicted by g. However, 

eo(/) = e p (g) = 1 - p Vg > 1 - p y * 

since y, = argmax^ p y 7^ argmax g y = y g . 

By the second regularity property, there must also be an / G jF such that argmax^ f y (x) — 

y* so that er>(/) > inf^/ e g- er>(/') = eu(/) = 1 — p y *. Thus, we have shown that there 
exists a distribution D so f G ^ is a minimiser of the risk £d but is not a minimiser of the 
misclassification rate en which contradicts the assumption of the ^-consistency of I. Therefore, 
i must be FCC. 



C Proof for PAC-Bayes Bounds 

For explicitly, we rewrite M and p y as M (x, y; w) and p(y\x; w) when they are parameterized 
by w, 

Theorem 4 (Generalisation Bound) For any data distribution D, for any prior P over w, for 
any 5 £ (0, 1] and a G [0, 1) and for any 7 > 0, for any w, with probability at least 1 — 5 over 
random samples S from D with m instances, we have 

1 m 

E D [( 7 -M(:r,y;™))J < — ^ ^7 — M(xi, yr, w)j 



1 I IT J ln ph) + lnyl ( Q > w ) + ln i(T=7-2) 
+ 7~, r ol\ h 



(1 — a) I V m V 2m 



where 



R(a,w) = aE D [ - lnp(j/|a;; w)] + (1 - a) E D [(7 - M(z,i/;w)) ], 



Rs(a,w) = \a h (1 — a) 

L m m 



Here ^4 is upper bounded independently of D. For example, for a zero-one loss, it is upper 
bounded by m + 1 (see [?]). The theorem gives a bound on the true margin error of the hybrid 
model. The theorem follows theorem[6]in the appendix immediately. 

Lemma 5 (PAC-Bayes bound[?, ?]) For any data distribution D, for any prior P and posterior 
Q over w, for any 8 G (0,1], for any loss £. With probability at least 1 — <5 over random sample 
S from D with m instances, we have 



KL(Q\\P) + In(§ Ea-om E W ^ P e 2 ™WQ>*)-Rs(Q,£)) 2 ) 



\5 

2m 



R(Q,£) < R s (Q,t) 

where KL(Q\ \P) := E wr ^Q ln( p^l ) is the Kullback-Leibler divergence between Q and P, and 
R(Q,£) = E Q , D [£(x,y;w)],R s (Q,e) = E Q Sfeiteii^) , 
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Theorem 6 (Bound on Averaging classifier) For any data distribution D, for any prior P and 
posterior Q over w, for any S G (0, 1] and a 6 [0, 1) and for any 7 > 0. With probability at 
least 1 — 6 over random sample S from D with m instances, we have 

1 m 

Eq.d [[j-M(x,y;w)]+] < — Eq []T[ 7 - M( Xi , yi ; w)]+] 



Y + 1 jKL(Q\\P)+\nA(a) + In j^^^ 



1 — a ]/ m 1 — a 



2m 



where KL(Q\ \P) :=E wr ^Q ln( p^ ) is the Kullback-Leibler divergence between Q and P, and 
R(a) = aE Q , D [ - \rip(y\x; w)] + (1 - a)E Q , D [(7 - M(x, y; w)) J , 

*.«.) - «, LSk^M^ + (1 _ g.(7-M(„,v, ;ro) ) 

L m m J 

= E s . Dro E_ P e 2-(«W-«s(-)) 2 . 

Proof Since E D ^Eq [ E£l ''"J^' 1 "' 1 ^ ]) = Eq, d [ - lnp(j/|a;; w)] , by Chernoff bound 



we have 

P S ~d™ I Eq 



— — — - Eq, d [ - lnp(y\x; w)J 

Define := Eg [S^^tol] _ Eq d j _ lnp(y | x; w) ] _ 



< e > 1 - e 
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Applying Lemma[5]for R(a) and Rs(a), we have for any P, Q 



8 >P S ~ D ™ R(a) > R s (a) + 



KL(Q||P) + lni+lnA(a) 
2m 



KL(Q\\P) +lni +\nA(a) , , 
>Ps~ D ". | fl(a) > -Rs(a) + V — V " 7 ^,5(5) <e 



(1-o)Eq, d [(7-M(x,j/; TO )) ]>(l-a) 



V 



/KL(Q||P) +lni +lnA(a) , 
+ \/ 2m A — ,B(S) < e 



jD ™ (1-o)Eq,d [( 7 -M(:r,2/;w)) ] > (1-a) 



Efci (7-M(x 4 , 



KL(Q||P) +ln| +]nA(a) 
2m 



(l-a)E Q , D [(~f- M(x,y;w)^ ] > (1 - a) 



5(5) < e^J Ps-b™ (fi(5) < e) 

E™i (7-M(x i)W ;«;))_ 



V 



+ae + 



KL(Q||P) +lni +lnA(a) 
2m 



5(5) < e) 



Divide two sides by Ts~D m ( 5(5) < e ) , we get 
/ 



(l-a)Kg,B [(7- M(x, y-wfj ]>(!-<*) 



YZi (l-M(x i>yi ;w)) 



\ 



+ae + 



KL(Q||P) + ln|+ln^(Q) 
2m 



Ps-D™ (5(5) 



< e 



^ _ g-2me 2 



Let e 



and then let 8' 



P*y = T3^2' we get 8 



The theorem 



l_ e -2m(e*) i_ e -a »"-&--- i'(l-e-2)' 

follows by substituting <5 with 8' and dividing by (1 — a) on both sides of the inequality inside 
of the probability. ■ 
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